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Abstract 

The purpose of this paper is to provide further understanding into the structure of the sequen¬ 
tial allocation (“stochastic multi-armed bandit”, or MAB) problem by establishing probability 
one finite horizon bounds and convergence rates for the sample (or “pseudo”) regret associated 
with two simple classes of allocation policies n. 

For any slowly increasing function g, subject to mild regularity constraints, we construct 
two policies (the g-Forcing, and the y-Inflated Sample Mean) that achieve a measure of regret 
of order 0(g(n )) almost surely as n —> bound from above and below. Additionally, almost 
sure upper and lower bounds on the remainder term are established. In the constructions herein, 
the function g effectively controls the “exploration” of the classical “exploration/exploitation” 
tradeoff. 


Keywords: Forcing Actions, Inflated Sample Means, Multi-armed Bandits, Sequential Allocation, 
Online Learning 


1 Introduction and Summary 

The basic problem involves sampling sequentially from a finite number of K ^ 2 populations or 
“bandits,” where each population i is specified by a sequence of real-valued i.i.d. random variables, 
W^t. with Xj. representing the reward received the k ,h time population i is sampled. The distri¬ 
butions Fj of the X! are taken to be unknown; they belong to some collection of distributions -F. We 
restrict & in two ways: 

The first, that each population i has some finite mean /i, = E A ; '] = xdFj (x) < °° - unknown to 
the controller. The purpose of this assumption is to establish for each population i the Strong Law 
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of Large Numbers (SLLN), 


P (]\mX‘ = p,^j = 1. (1) 

Second, we assert that each population has finite variance of = Var(A^) < oo. The purpose of this 
assumption is to establish for each population i the Law of the Iterated Logarithm (LIL), 

P | limsup + Xk ~^ =OiV :2 | = 1. (2) 

V k ^Inlnk/k ) 

It will emerge that the important distribution properties for the populations are not the i.i.d. struc¬ 
ture, but rather Eqs. dTJ, © alone. This allows for some relaxation of assumptions, as discussed 
in Section [5] In fact, the LIL (and therefore the assumption of finite variances) is only really re¬ 
quired for the derivation of the regret remainder term bounds in the results to follow - the primary 
asymptotic results depend solely on the SLLN. 

Additionally, we define \u* = max,/r, and we take the optimal bandit to be unique - that is, there 
is a unique i* such that p,- = p*. It is convenient to define the bandit discrepancies {A,} as A,- = 
M* — Mr ^0- 

For any adaptive policy n, let n(t) = i indicate the event that population / is sampled at time t, and let 
T^(n) = Y!!=\ 1 n{t)=i denote the number of times i has been sampled during periods t = 1,2, ... ,n, 
under policy n\ for convenience we define 7^(0) = 0 for all /, n. One is typically interested in 
maximizing in some well defined sense the sum of the first n outcomes S K (n) = 
achieved by an adaptive policy n. To this end we note that if the controller had complete information 
(i.e., knew the distributions of the X‘ k , for each ;), she would at every round activate the “optimal” 
bandit i*. Natural measures of the loss due to this ignorance of the distributions, are the quantities 
below: 


K K 

R*(n) = nil* ~I>7>) = X>7>), (3) 

i—l i— 1 

K 

R*(n) = nil*-E[S K (n)\ = £ A,-E [T' n (n)\ . (4) 

i= 1 

The functions R n {n), Rji{n ) have been called in the literature pseudo-reget, and regret; for notational 
simplicity their dependence on the unknown distributions is usually suppressed. 

The motivation for considering minimizing alternative regret measures to R n (n) is that while the 
investigator might be pleased to know that the policy she is utilizing has minimal expected regret, 
she might reasonably be more interested in behavior of the policy on the specific sample-path she is 
currently exploring rather than aggregate behavior over the entire probability space. At an extreme 
end of this would be a result minimizing regret or pseudo-regret surely (sample-path-wise) or almost 
surely (with full probability), guaranteeing a sense of optimality independent of outcome. We offer 
an asymptotic result of this type here in Theorem 2. 
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Note that E[^(n)] = R K {n), and “good policies” are those that achieve a small rate of increase for 
one of the above regret functions. Further relationships and forms of pseudo-regret are explored in 

Bubeck and Cesa-Bianchi 0 ], e.g., the “sample regret” R' n {n ) = np* — S K (ri) = np* — Yd=\ 'Lk-\ * 

We find the pseudo-reget R n (n ) = n/J.* — Y,f=\ in some sense more philosophically satisfying 

to consider than sample regret, for the reason that - given her ignorance and the inherent randomness 
- the controller cannot reasonably regret the specific reward gained or lost from an activation of a 
bandit, as in R’ K {n). She can only reasonably regret the decision to activate that specific bandit, 
which is captured by R^{n)’ s dependence on the T l n {n) s alone. 

Thus, we are particularly interested in high probability or guaranteed (almost sure) asymptotic 
bounds on the growth of the pseudo-regret as n —>• °o. The main result of this paper is Theorem 
[H which establishes, by two examples, that for any arbitrarily (slowly) increasing function g(n), 
e.g., g(n) = lnln... Inn, that satisfies mild regularity conditions there exist “y-good policies” K„. 
The later policies are such that the following is true 

Rn g (n) = C^({F,-})g(n) +o{g(n)), as n -> °° 

(i.e., RjiJn) = 0(g(n)), (a.s), as n —>• °o) for every set of bandit distributions { F, } c &, for some 
positive finite constant C Ug ({/-}}). 

The results presented here are in fact intuitive, in the following way: it will be shown that in the 
g-Forcing and g-ISM index policies, the function g essentially sets the investigator’s willingness to 
explore and experiment with bandits that do not currently (based on available data) seem to have 
the highest mean. Even if the controller explores very slowly (i.e., she chose a very slow growing 
g), as long as she explores long enough she will eventually develop accurate estimates of the means 
for each bandit, and incur very little regret (or pseudo-regret) past that point. We note here that, 
for the most part, we do not recommend the actual implementation or use of these policies. The 
cost of this guaranteed asymptotic behavior is that (depending on g and the bandit specifics), slow 
pseudo-regret growth is only achieved on impractically large time-scales. We find it interesting, 
however, that such growth can be guaranteed - independent of the specifics of the bandits! - with 
as weak assumptions as the Strong Law of Large Numbers. This makes these results fairly broad. 
Additionally, the g-Lorcing and g-ISM index policies individually capture elements present in many 
other popular policies, and are suggestive of the almost sure asymptotical behavior of these policies. 
One takeaway from this is, perhaps, to emphasize that asymptotic behavior by itself is little basis 
for thinking of a policy as “good”. As essentially any asymptotic behavior is possible (through the 
choice of g), any useful qualification of a policy must consider not only the asymptotic behavior, 
but also the timescales over which it is practically achieved. 

In the remainder of the paper, we define what it means for a policy to be g-good (Definition [I]), and 
establish the existence of g-good policies (Theorem Q]) for any g satisfying mild regularity condi¬ 
tions. The proof is by example, through the construction of g-Lorcing and g-ISM index policies 
that satisfy its claim. Lurther, bounds on the corresponding order constants of pseudo-regret growth 
are established for each policy (Theorems [2] and |4}, as well as bounds on the asymptotic remainder 
terms (Theorems Q] and 5 0), bounding the remainder from both above and below. We view the 
proofs of the asymptotic lower bounds, as well as the derivation of the remainder terms via a sort of 
bootstrapping on the earlier order results, as particularly interesting. 

In the attempt to generalize some of these results for the g-ISM index policy, an interesting effect 
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and seeming “phase change” in the resulting dynamics was discovered. Specifically, as discussed in 
Remark 2, when there are multiple optimal bandits, for g of order greater than \//i In In /j all optimal 
bandits are sampled roughly equally often, while for g of order less than \//z In In /i, the g-ISM index 
policy tends to fix on a single optimal bandit, sampling the other optimal bandits much more rarely 
in comparison. 

2 Related Literature 

Robbins first analyzed the problem of maximizing asymptotically the expected value of the 
sum S K (n). Using only the assumption of the Strong Law of Large Numbers for .5U for K = 2. He 
constructed a modified (outside two sparse sequences of forced choices) “play the winner” (greedy) 
policy, Kr, such that with probability one, as n —)• S KR {n)/n —>• /./*. From this he was able to 
claim, using the uniformly integrability property for the case of Bernoulli bandits that 

R 7lM) = °(«)> as n °°■ (5) 

Lai and Robbins @] considered the case in which the collection of distributions & to consist of 
univariate density functions f(x\ 0,) with respect to some measure v,-, where /(.;.) is known and the 
unknown scalar parameter 0, is in some known set 0. Let /.(, = jd (0/j = E[X]], fd* = max,{/.((0,)} = 
Id (6*), A i(6i) = Id (6*) — n(6i), and let 1(0110') = In f(x; 0) dv(x) denote the Kullback - 
Leibler divergence between f(x;6) and f(x: 0'). They established, under mild regularity conditions 
((1.6), (1.7) and (1.9) therein), that if one requires a policy to have a regret that increases at slower 
than linear rate: 

Rit(n) = o(n a ), Va > 0, as n —>• oo, V{0,} C 0, (6) 

then n must sample among populations in such as way that its regret satisfies 

liminf^M ^M lr (0 1 ,...,0^), V{0/} C 0, (7) 

n mn 

where 

M lr (0 1; ...,0k)= £ A f (0;)/I(0;||0*). 

v.n{ei)^n* 

Burnetas and Katehakis |0] extended and simplified the above work for the case in which the 
collection of distribution & is specified by a known function f{x\Q_i) that may depend on an 
unknown vector parameter 6_j € 0,-, as follows. Let 6_:= (0.1,■.., Q_k) € 0 = 0i x • • • x &k, 
fd* = j u(0*) = maxj{ju(0,-)}, A = id* — fd(6_j). They showed, under certain regularity condi¬ 
tions (part 1 of Theorem 1, therein) that if a policy satisfied Eq. ([6]), V0_ E 0, then it must sample 
among populations in such as way that its regret satisfies: 

liminf^^ ^M bk (0), V0 E 0, (8) 

n ln/7 — — — 

where 

Mbk(£)= £ AKiO/infWi^') : H (£j) > F(£*)}• (9) 

ieB{g) 
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Further, under certain regularity conditions (cf. conditions “A1-A3” therein) regarding the estimates 
9_i = 0_j (X\.... .Xj , ,) of the parameters (f. /(.;.) and 0,, they showed that policies which, after 
taking some small number of samples from each population, always choose the population n°(n) 
with the largest value of the population dependent index: 


/V ! \ J /„/\ ln/i + o(ln/i 

«/(£,•)= SL1 P </*(£«) : !(£,-,£,■)<- T7 / \ 


£'e©, 

are asymptotically efficient (or optimal), i.e., 


T‘n) 


limsup^^ ^M bk ( 00/r), V0€0. 
n mn * —■ — 


( 10 ) 


(ii) 


The index policy 7T° above, was a simplification of a UCB type policy first introduced in Lai and 
Robbins |9(] that utilized forced actions. Policies that satisfy the requirements of Eq. d5j, Eq. ©, and 
Eq. dTTb were respectively called uniformly consistent (UC), uniformly fast convergent (UF), and 
uniformly maximal convergence rate (UM) or simply asymptotically optimal (or asymptotically 
efficient). The lower bound of Eq. ([9]) provides a baseline for comparison of the quality of policies 
and together with Eq. (fill) and Eq. ([8]) provide an alternative way to state the asymptotic optimality 
of a policy tt° as: 

Rji°( n ) =^BK(^)lnn + o(lnn), VP € 0. (12) 

Policies that achieve this minimal asymptotic growth rate have been derived for specific parametric 
models in Lai and Robbins @], Burnetas and Katehakis J3l, Flonda and Takemura Oj], Flonda and 
Takemura |@|. Honda and Takemura iQ], Cowan et al. o] and references therein. In general it is 
not always easy to obtain such optimal polices, thus, policies that satisfy the less strict requirement 
of Eq. ©, VP € 0, have been constructed, cf. Auer et al. [2], Audibert et al. [ l||, Bubeck and 
Cesa-Bianchi |@] and references therein. Such policies usually bound the regret as follows: 

Rjiln) ^ M°(9_)lnn + M l (6_), for all n and all (13) 


where M°(0J is, often much, bigger than Mbk(^), for all Q_. 


The results presented herein can seem surprising, and it may appear to contradict (at least for g(n) = 
Inn) the classical lower bound Mbk( 0.) of R n (n)/\nn for UF policies n. For example, if we take F, 
to be the normal distribution with unknown mean p, and unknown variance of, we have for any UF 
policy k: 


limS^Wi 

n In n 


^ M B k {lF<L ) 


2A; 


In ( 1 + 


A? 


On the other hand we establish in the sequel that: 

%(«) 

lim 


- , , =%(M)= I A, (a.s.), 

" S(n) * 


Rno(n) 


(14) 


lim 


» g(n) 


= C n MF i })=K- 1 (a.s.). 


5 








However, no such contradiction exists: Mbk( 0.) limits the lim„E[^ re (/r)]/lnn of a UF policy from 
below. In such contexts that n F or K ( ;! are UF, if such contexts exist, the above constants will be 
bounded from below by M BK (0}. In such contexts that n F or 7are not UF, the bound does not 
apply. In such instances, we may in fact conclude from the results presented herein, and standard 
results relating modes of convergence, that for the policies constructed here, for g(n ) = 0{\nn), 
the sequences of random variables R k f(h) / g(n), R n o(n)/g{n) are not uniformly integrable. An 
example as to how this can occur is given via the proof of Theorem 2 of Cowan et al. fl], where 
with a non-trivial probability, non-representative initial sampling of each bandit biases expected 
future activations of sub-optimal bandits super-logarithmically. This effect does not influence the 
long term almost sure behavior of these policies. 


3 Main Theorems 

We characterize a policy by the rate of growth of its pseudo-regret function R K (n) with n in the 
following way. 


Definition 1 For a function g(n), a policy n is g-good if for every set of bandit distributions {F t } c 
, there exists a constant C K ({ F /}) < °o such that 


limsup 

n 


Rn{n ) 

g( n ) 


^ C n {{Fi}) ( a.s ) as n —>• °o. 


(15) 


Remark 1: Essentially, a policy is g-good if R n (n) < 0(g(n)) (a.s), n —> oo. Trivially, policies exist 
that are /z-good (i.e., R K {n) f 0{n ) (a.s.)), for example any policy that samples all populations at 
constant rate 1 /K. 

We next state the following theorem: 


Theorem 1 For g, an unbounded, positive, increasing, concave, differentiable, sub-linear function, 
there exist g-good policies. 


The proof of this theorem is given by example with Theorems [2} 0] which demonstrate two g-good 
policies: the g-Forcing and the g-ISM index policies. 

We note that in the sequel it will be assumed that any g considered is an unbounded, positive, 
increasing, concave, differentiable, sub-linear function. 


3.1 A Class of g-Forcing Policies 

Let g be as hypothesized in Theorem [Q We define a g-Forcing policy 7T,( in the following way: 
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g-Forcing policy: A policy that first samples each bandit once, then for t ^ K, 


if min ; - T^ F (t ) ^g(t), 


argma x,-X'■ 

<(>+i)=i y 

argmin; T 1 %t (t) else 


( 16 ) 


Briefly, at any time, if any population has been sampled fewer than g(t) times, sample it. Otherwise, 
sample from the population with the current highest sample mean. Ties are broken either uniformly 
at random, or at the discretion of the investigator. In this way, g can be seen as determining the rate 
of exploration of currently sub-optimal bandits. This can be viewed as a variant on the policy % 
considered in Robbins IlCIl . 

It is convenient to define the following constant, 

S A = £ A,-. (17) 

The value S A in some sense represents the pseudo-regret incurred each time the sub-optimal bandits 
are all activated once. The next result states that g-Forcing policies satisfy the conditions of Theorem 

ffl 


Theorem 2 For a policy as in (fl6l ). is g-good, and 


, gin) 



(18) 


The above theorem can be strengthened in the following way, bounding the asymptotic remainder 
terms almost surely: 


Theorem 3 For a policy as in (1161) . die following are true: 


limsup \R n F{n) -S A g(n)) ^ S A = 1 


(19) 


and 


P^liminf \R K F(n) — S A g{n)J ^ OJ = 1. (20) 

Proof. [Theorems |2] and O Theorems |2j|3] follow immediately from the following proposition, the 
proof of which is given in Appendix [Al 


Proposition 1 For policy 7lg as in (1161) . the following is true: For every e > 0, almost surely there 
exists a N e <°° such that, for all n f N e , 

g(n)S A -£^R K F(n) ^ |g(«)]S A . (21) 
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Using the above relation to bound first the limits as n —»• °° of R k f (n) / g(n ), then R k f (n) — S\g(ri) 
(observing that [~g(n)~| — g{n) U 1), give the desired results. 


Proposition [Tj is considerably stronger than Theorems [2j [3] However, it somewhat obscures the true 
nature of what is going on: for sufficiently large n, almost surely, sub-optimal bandits (/: /i/ / /U) 
are only activated during the “forcing” phase of the policy, when some activations are below g. 
As a result, since g increases slowly (e.g. is sub-linearly), for large n, TL F (n) = [g(n)~| - except 
for a discrepancy that occurs, for a brief stretch (< K) of activations, whenever g surpasses the 
next integer threshold. At this point, the policy raises the activations of each sub-optimal bandit, 
restoring the previous equality. Hence, in fact, equality holds in Proposition Q] (R k f (n) = [g(n)] .S'a) 
for most large n. Discrepancy occurs increasingly rarely with n, based on the hypotheses on g. If, 
additionally, the controller specifies a deterministic scheme for tie-breaking, pseudo-regret may be 
determined explicitly for all sufficiently large n. Leaving ties to the discretion of the controller, 
Proposition Q] is as strong a statement as can be made. 

3.2 A Class of g-Index Policies 

In this section, we consider an index policy related to the classical ”UCB” index policies. Let g be 
as hypothesized. For each i, define an index on (j,k) € Z 2 , 

Ui (j,k)=X’ k + ^. (22) 


g-ISM index policy: A policy %[! that first samples each bandit once, then for t ^ K, 

rf(t + \) = argmax m(t , T^„ (f)) = argmax + . (23) 


Briefly, at any time, the sample means of each bandit are “inflated” by the g(t)/T' 0 (t) term, and 
the policy always activates the bandit with the largest inflated sample mean. When unsampled, a 
bandit’s inflated sample mean increases essentially at rate g, hence g drives the rate of exploration 
of current sub-optimal bandits. While this policy is inspired by more traditional ’’Upper Confi¬ 
dence Bound” policies, we refer to this as an Inflated Sample Mean policy, as it has no deliberate 
connection to confidence bounds. 


More general index policies of this type could also be considered, for instance based on an index 
X' k +Hi ( g(j)/k ) where H, is some positive, increasing function of its argument. This is more in line 

with the common UCB policies, which frequently have inflation terms of the form O ( y \J\nn/Tl l (n)'^ 
(though this is hardly necessary, c.f. Cowan et al. [Hi]) with Inn serving the “exploration-driving” 
role of g. However, introducing this extra H ! function does not influence the order of the growth 
of pseudo-regret, it simply changes the relevant order constants, at the cost of complicating the 
analysis. 
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Theorem |4] below shows that a g-ISM index policy satisfies the conditions of Theorem Q3 and gives 
the minimal order constant C K o for this policy. 


Theorem 4 For a policy K ( f as in (1231 ). if the optimal bandit is unique, 

( R K o(n) \ 

P lim * =K-1 = 1. (24) 

\ n Sin) J 

The proof of this theorem depends on the following propositions, the proofs of which are given in 
Appendix |B] Interestingly, these results (and therefore Theorem [4]) depend only on the assumption 
of the SLLN, not the LIL. 


Proposition 2 For each sub-optimal i, Ve G (0, A,/2), 3 (a.s.) 


n^K, 



Sin) 

A i - 2e 


+ CL. 


a finite constant C' e such that for 

(25) 


Proposition 3 For each sub-optimal i f i*, Ve G (0. min^,-* A ; -/2), 
that for n f N', 


_ Sin) _ 

(1 + e)(A,- + 2e) + 2e 



3 (a.s.) 


some finite N r such 
(26) 


Proof. [Theorem|4] For each sub-optimal bandit i, as an application of Props. [2j|3] taking the limit 
of T 1 Q (n)/g(n) first as n —» oo, then as e —>• 0, gives lim„ T‘ 0 (n)/g(n) = 1/A,-, almost surely. The 

Kg K g 

theorem then follows similarly, from the definition of pseudo-regret, Eq. yj). 


Remark 2: In the case that the optimal bandit is not unique, it happens that Prop. [2] still holds. 
It can be shown then that n < :f remains g-good in this case, and has a limiting order constant of 
at most K — K* ( K* as the number of optimal bandits). We leave as an open question, however, 
that of producing a Prop. [3]-type lower bound and the verification of K — K* as the minimal order 
constant. The proof of Prop. [3]for K* = I depends on establishing a lower bound on the activations 
of the unique optimal bandit: in short, at time n, since the sub-optimal bandits are activated at most 
0(g(«)) times (which holds independent of K*), it follows from its uniqueness that the optimal 
bandit is activated at least n — 0(g(n)) times. If, however, K* > I and the optimal bandit is not 
unique, while the optimal bandits must have been activated at least n — 0(g(n)) in total at time /;, 
and the distribution of these activations among the optimal bandits is hard to pin down. Simple 
simulations seem to indicate a sort of “phase change”, in that for g of order greater than \//;lnln/; 
all optimal bandits are sampled roughly equally often, while for g of order less than \///lnln/i. the 
policy tends to fix on a single optimal bandit, sampling the other optimal bandits much more rarely 
in comparison. 

We offer the following as a potential explanation of this observed effect (and justification of the 
difficult to observe lnlnzi term): Let us hypothesize, for the moment, that under any circumstances, 
the optimal bandits are activated linearly with time, that is for any optimal /*, Tf, (n) = 0[n), with 
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the order coefficient depending on the specifics of that bandit. Under policy , activations are 
governed by a comparison of indices. We consider then the fluctuations in value of the two terms 

of the index, the sample mean X', and the inflation term g(n)/T‘ 0 (n). Under the assumption 

Uo \ n > "s 

the optimal bandits are activated linearly, and reasonable assumptions on the bandit distributions 
(to grant the Law of the Iterated Logarithm), the fluctuations in the sample mean over time will 
be of order ()( yflnlnn/n). The fluctuations in the inflation term will be of order 0(g(n)/n). It 
would seem to follow then that for g of order less than 0(Vn\n In/i). when comparing indices of 
optimal bandits, the sample mean is the dominant contribution to the index, while for g of order 
greater than 0{\Jn\n\nn), the inflation term is the dominant contribution to the index. When the 
inflation term dominates, among the optimal bandits an “activate according to the largest index” 
policy essentially reduces to a “activate according to the smallest number of activations” policy, 
which leads to equalization and all optimal bandits being activated roughly equally often. When 
the sample mean dominates, among the optimal bandits an “activate according to the largest index” 
policy essentially reduces to an “activate according to the highest sample mean” or “play the winner” 
policy, which leads to the policy fixing on certain bandits for long periods. 

This explanation would additionally suggest that on one side of the phase change, when the infla¬ 
tion term dominates, the only properties of the optimal bandits that matter for the dynamics of the 
problem are their means, that they all have the optimal mean fi*. But on the other side of the phase 
change, when the sample mean dominates, other properties such as the variances {a?} influence the 
dynamics, through the Law of the Iterated Logarithm. However at this point in time, this remains, 
while interesting, speculative. 

Based on the above results, we have the following result: For each i / /*, V£ > 0, 3 (a.s.) some 
finite N e such that for N e , 

^ Ko( n ) < (27) 

Similarly, for the optimal bandit i*, 

«-(!+£)£ <«-(!-e)£^-g(n). ( 28 ) 

i^i* A ' s i^i * A ' 


It follows trivially from these that each bandit is activated infinitely often, i.e., almost surely {T' 0 («)}„>i 

n g ^ 

is equivalent to the sequence {0,1,...}, though with some (finite) stretches of term repetition. It fol¬ 
lows then, applying the LIL that 


P 


( *r 0 (n) to 

lim sup ± 

\ " ^lnlnr 0 (n)/r 0 (n) 



(29) 


This provides greater control over the sample mean of each bandit than what the Strong Law of 
Large Numbers alone allows, and allows the results of the previous asymptotic results to be strength¬ 
ened, as in the following theorem. 


Theorem 5 For a policy iff as in (1231) . then the following are true: 
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a) if g[n) = o(n /lnln/i), 


Oi 


R n o(n) — (K — \)g(n) 

limsup— —^2v2> __ 

^g(n)Ining(n) i+i , V^i / 


= 1, 


b) if g{n) =o(« 2/3 ), 


Oi 


' Ko(n)-(K-l)g{n) 

limmf— — 3V2.Y _ 

v n V 'g(n)\n\ng{n) VA, 


= 1. 


(30) 


(31) 


In short, we have that for a g-ISM index policy , 

R K o(n) = (K - l)g(n) + O ^g{n)lnlng(n)^ . 

It should be observed that, unlike previous results, this theorem is somewhat restrictive in its allowed 
g. However, since the focus is traditionally on logarithmic regret, i.e., g(n) = 0(\nn), it is clear that 
the above restrictions are nothing serious. 

This theorem follows trivially from the following refinements of Props. 0 0 and the definition of 
pseudo-regret, Eq. (0). Then - proofs are given in Appendix O 


Proposition 4 If g(n) = o(n /lnln n), for each sub-optimal i f i*, the following holds almost surely: 


A «r 0 (n)-g(n) 2a;V /2 

limsup —^ 
n y/s( n )^ n s{ n ) v^i 


(32) 


Proposition 5 If g{n) = o(jf^), for each sub-optimal i f i*, the following holds almost surely: 


A iV(n)-g{n) 3 ^ 

liminf — - ^- -=-. 

" V^(”) lnln ^( ?I ) vA 


(33) 


Again, we leave as an open problem that of extending these results to the case of non-unique optimal 
bandits. 


4 Comparison between Policies 

We have established two policies, g-Forcing and g-ISM index, that each achieve 0(g(n)) pseudo¬ 
regret, almost surely. The question of which policy is “better” is not necessarily well posed. For one 
thing, the asymptotic pseudo-regret growth of either policy can be improved by picking a slower 
g. In this sense, there is certainly no “optimal” policy as there will always be a slower g. For a 
fixed g, however, the question of which policy is better becomes context specific: for some bandit 
distributions, the order constant of the g-Forcing policy, .S' a, will be smaller than the order constant 
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of the g-ISM index policy, K — 1; for some bandit distributions, the comparison will go the other 
way. 

In terms of the results presented here, the pseudo-regret of the g-Forcing policy is much more tightly 
controlled. Proposition [l]bounding the fluctuations in pseudo-regret around S^g(n) by at most a con¬ 
stant - indeed, at most S\. The bounds on the g-ISM index policy however are 0{y/g{n) Inlng(«)). 
But, this additional control of the g-Forcing policy comes at a cost. It follows from the proof of 
Proposition [H that for sub-optimal /, for all large n. 


r F (n)K,g(n). 

g 


(34) 


However, for the g-ISM index policy, following the proof of|4j for all sub-optimal i, and large n, 



(35) 


It is clear from this that the g-Forcing policy is in some sense the more democratic of the two, 
sampling all sub-optimal bandits equally, regardless of quality. The g-ISM index policy is the more 
meritocratic, sampling sub-optimal bandits more rarely the farther they are from the optimum. This 
has the effect of boosting the sampling of bandits near the optimum, but this effect is somewhat 
counterbalanced as they contribute less to the pseudo-regret. 


5 Relaxing Assumptions: i.i.d. Bandits 

The assumption that the results from each bandit are i.i.d. is fairly standard - the problem is generally 
phrased as a matter of knowledge discovery about a set of unknown distributions, though the use of 
repeated measurements. However, it is interesting to observe that this assumption actually plays no 
part in the results and proofs present in this paper. The sole distributional property that mattered for 
establishing the policies as g-good was the assumption that for each bandit there existed some finite 
/i, such that X* k /i, almost surely with k (though the Law of Iterated Logarithms was utilized to 
great effect in bounding the remainder terms). In fact, the expected values of the individual Xj need 
not be Hi, nor must the X’ k be independent of each other for a given i. Further, it is never necessary 
that the bandits themselves be independent of each other! In that regard, the results herein are 
actually quite general statements about minimizing pseudo-regret under arbitrary multidimensional 
stochastic processes that satisfy that strong large number law-type requirement. 

However, a word of caution is due: removing the restrictions on {X k }^i in this way, while not 
influencing the proofs of the results presented here, does somewhat call into question the definition 
of “pseudo-regret” as given in Eq. ©. The individual sample means freed, it is not necessarily 
reasonable to define a finite horizon pseudo-regret, R K (n), in terms of the infinite horizon means, 
{/!,}. For instance, it is no longer necessarily true that the optimal, complete knowledge policy on 
any finite horizon is simply to activate a bandit with infinite horizon mean /I* at every point. A more 
applicable definition of pseudo-regret would have to take into account what is reasonable to know 
or measure about the state of each bandit in finite time. 


Acknowledgement: We would like to acknowledge support for this project from the National Sci¬ 
ence Foundation (NSF grant CMMI-14-50743). 
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A Proof of Proposition U 

Proof. To prove Proposition Q] it will suffice to show the following: For all i : /i, f p* and all 
8 > 0, 3 (a.s.) a finite time T§ <°° such that that, 

g(t)-28 f T' F (t ) < [*(01 T S . (36) 

g 

Theorem [Qfollows from this result and Eq. (0). with the appropriate choice of 8. 

Without loss of generality, we may restrict ourselves to 5 < 1/2. 

As a preliminary step: Based on the properties of g, if K is the total number of bandits, there exists 
a finite, not random, time t§ such that, the following is true: 

g(t + K)<g(t) + 8,Vt>t s . (37) 
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This follows from the observation that g{t + K) ^ g(t) +g'(t)K, and that g'(t) —>• 0. 

When implementing a g-Forcing policy Kg (hereafter referenced simply as k), there are essentially 
two alternating phases (or modes) of the policy: “catch up” and “play the winner”. During “catch 
up”, some number of bandits have fewer than g activations (the sub-g bandits), and they are activated 
until all bandits have at least g activations. During “play the winner”, each bandit has at least g 
activations, and the bandit with the current greatest sample mean is activated. These phases can be 
seen as governed by the function A (t) = g(t) — min, T^(t) so that when A (t) >0, the policy is in 
“catch up” mode, when A (t) if 0, the policy is in “play the winner” mode. 

Flaving activated bandits according to policy K up to time tg, suppose that A (tg) > 0, hence the 
policy enters or is in a period of “catch up”. Let d(= d(tg )) be the number of sub-g bandits at time 
t§. Because g is increasing, and there are d sub-g bandits at time tg, it will take at least d “catch 
up” activations before the policy enters a period of “play the winner” (A ^ 0). Consider activating 
bandits according to policy K for d activations. Note, d ^ K, so from Ineq. (IT71) and increasing 
property of g we have: g(tg +d) < g(tg) + 5. Additionally, min, T^(tg + d) > min, T^(tg) + 1, as 
every bandit realizing the minimum activations will have been activated at least once. It follows that 

\(tg+d)=g(tg+d)-mmT' n (tg+d) 

l 

<g(,tg) + 8- min7^(f 5 )-l (38) 

l 

= A( tg) - (1 - 8). 


Flence, after a period of d activations from time tg, the spread A has decreased by at least 1 — 5. 
Repeating this argument, based on the number of sub-g bandits (if any) at time tg + d, it is clear that 
eventually - in finite time - a time Ta < °° is reached such that A(7 a) ^ 0. At this point, all bandits 
have been activated at least g times, and the policy enters a period of “play the winner”. We observe 
the loose, but sample-path-wise, bound that, 


Ta ^ tg + K 


(A(?g)) + 

1-5 


^ tg + K 


gifs) 

1-5 


< oo, 


(39) 


since A (t) df g(t ) always, and at every step the number of sub-g bandits is at most K. Observe that 
if in fact A (tg) ^ 0, then we may take T\ = tg. 

Flaving entered a period of A ^ 0 or “play the winner” at time Ta, let t ^ Ta such that A(t) ^ 0 but 
A(t + 1) >0. That is, in the transition from time t to t + 1, g surpasses the number of activations of 
some bandits and the policy enters a period of “catch up”. At such a point, we have the following 
relations: 

min7^(f + 1) <g(t + l) <g(t) + 8 ^minr^(0 + 5. ( 40 ) 

i i 

The first inequality is simply that A(f + 1) > 0, the second following since t ^ tg, and the last since 
A(f) ^ 0. However, since the T' n are integer valued and non-decreasing, the above yields 

min 7^(t + 1) = min T l % {t). (41) 

l l 

Combining Eqns. (l40l) . (|4T1) yields the important relation that A(t + 1) < 5. Note additionally, 


g{t + 1) <g(*) + 5 ^ min7^0) + 5 < min7^0+1) + 1. 

I l 


(42) 
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Again noting the T' l n are integer valued, this implies that while there are sub-# bandits at time t + 1, 
the only sub-# bandits are those that realize the minimum number of activations min,- T l n (t +1). All 
other bandits have activations strictly greater than #. Let the number of sub-# bandits at time t + I 
again be denoted d = d(t + 1). For d’ < d K) additional activations under n, in the “catch up” 
phase, we have that min,- T^(t + 1 + d r ) = min,- T l n (t + 1) and g(t + 1 + d’) < g(t + 1) + 8. Flence, 
A(f + I + d r ) < A(f + 1) + 8 < 28. For d additional activations after time t + 1, each sub-# bandit 
has been activated once, raising the minimum number of activations by 1: min, T l n (t + 1 +d) = 
min, +1) + 1. Additionally, g(t + 1 +d) < g(t +1) + 8, hence A(f + 1 +d) < A(t + 1) — 8 < 0. 

We see therefore that after T\, at any point at which A becomes positive after being at most zero, it 
is at most 28 for a finite time - the “catch up” phase - before becoming negative. Flence it follows, 
that for t ^ T a , A(?) ^ 28, or for each i 

g(t)-28^Tl(t). (43) 


Note, this is true for all i. This acts as justification for the description of # as the “forcing function”, 
as the policy forces all activations to grow at least at # asymptotically. 

Since # is unbounded and increasing, all populations are sampled infinitely often over time. Taking 
the strong law of large numbers to hold, for every e > 0 and each z, there exists almost surely some 
finite N‘ e such that X l k € [/.(,- — £./i, + e] for all k N‘ e . It is worth noting here that while such a N l e 
exists, it is random and unknowable to the investigator. Because of the properties of #, we may 
define a finite T’ > T A such that N’ e ^ #(T £ ') — 28. By Eq. (|43T ). we have that for all t ^ 7j!, 

X^ t) €[Hi-e,Hi + E]. (44) 


Flence we have for each population, for every e > 0, there exists almost surely a finite random time 
T e = max,- r £ < oo past which the sample mean is trapped within the /.z, ± £ interval. 

Fix e sufficiently small, so as to distinguish fi* from the other means (i.e., \jl* — £,/j,* + e] n [/i, — 
£.Hi + £ — 0 for all i: fij / ji '). By the previous observations, we have therefore that for all t ^ T e , 
for all sub-optimal i and any optimal i*. 


Xjr 

1 K 


(0 


>x k(t y 


(45) 


In short, almost surely there exists a finite time T e past which the sample means of sub-optimal 
bandits are always inferior to the sample mean of any optimal bandit. 

By the structure of the policy n, for all t ^ T e , sub-optimal populations are only activated during 
the #-forced “catch up” periods. If at time T e , the number of times a sub-optimal bandit i has been 
activated is greater than # - for instance due to it, at some point, having the largest sample mean 
during a “play the winner” period - that population will not be sampled again until # has increased 
to overcome this temporary excess. As # is increasing and unbounded, this must occur in finite 
time. Once this occurs, as observed previously, # can only exceed by at most 28 before bandit 
i is again activated, raising T' n above # once more. As this “catch up” is the only time bandit i is 
activated, and 8 < 1/2, it follows that there exists some finite time n> such that for t / ( ■. 
T^U) ^ [#(t)l • Taking T§ = max,- :ft #^- t' e , and noting that ^ T & sC T F ^ < oo, we have that for 
t ^ T§, for all sub-optimal i, 

g(t)-28^Tl(t)^\g(t)-]. (46) 
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B Proofs of Propositions El [3] 


In this section, n refers to a g-ISM index policy as in Eq. (1231 ). The results to follow depend on the 
following lemma. 

Lemma 1 Under the assumption ofEq. dTJ, for each i, and for any e > 0, the inequality: 

Ui(j,k) < ^-£ 

holds for only finitely many ( j,k)-pairs, almost surely. 

Proof. As an application of the strong law, almost surely there is some finite N‘ e such that 
X' k > p — s/2, for all k ^ N l e . For such k, as g is positive, u,-(j,k ) = Xj. + g(j)/k ^ Pi — e, for 
all j. For any k < Nfi the relation uf j.k) = X‘ k +g(j)/k < pj — £ may be true only for finitely many 
j since g is increasing. 

Proof, of Proposition HJ For i f /*, we define the following quantities. Taking e > 0, and 2s < 
p* - pi, and n K, 

n 

n\ (n,£) = El{x(t+l) = i,u i (t,Ti(t))£p*-£,X’ i(t . ^p‘ + £j 

t—N 

n 

4 («> e ) = ^l{7T(f + l ) = i,Ui(t, T l(t))^ p*-£,X‘ T i {t) > p‘ + £} (47) 

t=N 

« 3 (",£) = Y J k{Tt(t + \) = i,u i {tJ 1 K < f)) <P*-£}■ 

t=N 

Hence we have the following relationship, 

n 

T^(n + 1) = 1 + ^ l{7l(t + 1) = /} = 1 + n\(ft, e) + n^in, e) + 713 ( 71 ,£)■ (48) 

t=N 


The proof proceeds via a pointwise bound on each of the three terms. For the first term, 

n\(n,£) f j^l{n{t + l) = i,p i + £ + g(t)/Tl(t)^p*~£} 

t=N 

= £ 1 {n{t +1) = i,g(t)/((p* - Pi) - 2e) ^ r n (t)} 

t=N 


^ + 1) = i,g(n)/((p* - pi) - 2e) ^ T^t)} 

t=N 

g(n) 

<_ 6V ’ _l i 

p*-pi)-l£ 


(49) 
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The last inequality comes from viewing T^(t) as a sum of 1 {n(t + 1) = /} indicators, and seeing 
that the condition on it bounds the number of non-zero terms in this sum. 

For the second term. 


4(«,e) E 1 ( 7r ( f + !) = > A i ' + £ > 

t=N 

= i i !{*(' +1) = i,n > At'- + e, T‘(t) = k} 

t=Nk= 1 

= £ £ l{K(t + 1) = /, T l K (t) = k} 1{X‘ > + £} (50) 

t=Nk=\ 

E ^+£} E iw+1) = *, Kit) = k} 

k= 1 

*=i 

The last inequality holds as, for a given k, {n(t + 1) = i. T^(t) = k) may be true for only one t. 
Taking it one step further, we have 


oo 

4 («> e ) < E 1 {^ > A t ' + e }> 


k= 1 


(51) 


and since the strong law of large numbers is taken to hold, we have therefore that n\ (n) is almost 
surely bound by a finite constant, for all n ^ K. 

For the third term, note that from the structure of the policy, a population is only sampled if it has 
the maximal current index. Hence, if n(t + 1) = i, it must be true that ^ (?)) ^ Uj(t .T^t)). 

Hence we have the bound, 

n 

4(n,e) < Y, 1 { 1t ( t + l ) = i >ur(t,T£(t)) <M*-e} 

t=N 

^ E 1 K(F^(0)<M*-e} (52) 

t=N 

OO 

t=N 


From the prior observation about the form of the index, Lemma [TJ we have that Uj*(t,T£ (t)) < 
/r* — £ is true for only finitely many t, almost surely. Hence, from the above bound, n((/i) is almost 
surely bound by a finite constant, for all n ^ K. 

Combining the above results bounding n\,n l 2 ,n l 3 with Eq.(l48T). and observing too that T h( n ) < 
T^(n + 1), we have that almost surely there exists some finite C‘ E such that for all n ^ K, 


Kin) 


< 


Sin) 


(At* - Ah) - 2e 


+ CI. 


(53) 


17 



Proof, of Proposition |3j Define a constant Pa = L</;* 1 /(jU* — jU,-). Taking £ < min 7 y,»(/i* — 
jUy)/2, we may apply Prop. [2] to yield for each i / 3 (a.s.) a finite Ag such that 

(l + e)g(n)/(/i*-/Xi) for all n ^ /V'.. Taking A e = max/y,* Ag, summing over these relations and 
taking n ^ A e , 

£ri(n)<(l + e)g(n)P A . (54) 

i^i* 


The sum above equals the number of activations of sub-optimal bandits up to and including time n. 
As the total number of bandit activations up to time n is n, we have from the above that Tl (n) > 
n~0(g(n)). 

Trivially from this, the optimal bandit i* is activated infinitely often, approaching full density of 
activations as n increases. 

Given this linear lower bound on T' n , it follows that u,> (n. t£ (n)) converges to ju*, almost surely. 
Hence, almost surely there exists a finite N e such that for n ^ N e , Uj*(n,T£ (n)) ^ + £. As 

under this policy a bandit is only activated when it has the maximal index, it follows that infinitely 
often (on the activations of /*), the indices of all sub-optimal bandits are at most /./* + e. Given the 
structure of the indices, it follows that these sub-optimal bandits must be activated infinitely often 
as well. Hence, almost surely, T^(n) increases without bound, for all i. Applying the strong law 
here, since there are finitely many bandits being considered, 3 (a.s.) a finite “£-trapping time”, 
Ag rap , such that 

XL, •> € Lit; — £,/!; + £], Vn ^ Af ap and Vi. 


Let {n k } k ^o be the infinite sequence of times at which bandit i* has the current optimal index (and 
hence is activated next). For a given i / i*, we have that for all sufficiently large k {n k > Ag ap ), 


max Ui(n, T£(n)) ^ (jLt; + e) + 

n k ^n^n k+ i 


— (Pi + £) + 


— (lk + £) + 
^ {Hi + £) + 


g(»*+1) 

Ti(n k ) 

g(nk+ l) g(nk) 

g(n k ) Ti(n k ) 

g(n k + t) ( 


J^{ Ui (n k X(n k ))-X‘ T , {nk) 

i^-( Ui (n k X(n k )) -(/!;-£))• 


(55) 


Additionally, however, at time n k bandit i* has the largest index. For sufficiently large k (n k ^ 
N e ), this index must be at most n* + e. Hence for n k > max(A £ ,Ag ap ), for i / i* we have that 
Ui(n k , T^{n k )) ^ u?(n k ,T£ (n k )) < n* + £, and 


max Ui(n , T l n (n)) < (jU,- + e) + 

n k ^n^n k+1 

= (jU; + £) + 


g(«t+l) 

g(nk) 
g{n k + l) 
g(n k ) 


((M* +£) — (A*«■-£)) 
(H*-Hi + 2e). 


(56) 
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Since we took g to be concave, g(n k +i) ^ g{ n k) + («M- i — nk)g' (n k )- The difference n k +\ — n k — 1 
is the number of sub-optimal bandit activations between the k and k + 1-th activations of bandit i*. 
This is bound from above by the total number of sub-optimal activations prior to time n k+ 1 , which 
by Eq. (l54l) is at most (1 + £)g{n k+ \)P A for all n k | ^ A e . Hence, 

g(n k+ 1 ) ^ g{n k ) + ((1 + e)g(n k+ i)P A + l)g'(n k ). (57) 


As g' —» 0, for all sufficiently large k, we have that (1 + £ ) P A g'(n k ) < 1 and 


gpfi+l) 

S{ n k) 




i + 


g'(» k) 

sM 


l-(l+e)P A g r (n k )' 


(58) 


As g is taken to be increasing, and g' is taken to limit to 0, we have from the above that there is 
some finite Af such that for all sufficiently large k (n k ^ Af), g{n k +\) / g(n k ) ^ 1 + £. Hence, for 
n k ^ max(A e ,A e ,Ag rap ,A|), 


max Ui(n, 7^(n)) ^ (p,- + e) + (1 + £)(p* — Pi + 2e). (59) 

Let A^ = min{n^ : n k > max(A e ,A e ,Ae iap ,A|)} < As the upper bound above no longer depends 
on k, we have that for n >Nf, 


iij(n, T^(n)) ^ (ffi + e) + (1 + e)(p* — Hi + 2e). 


(60) 


Observing that XL ^ ^ p, 
2e), or 


£, the above yields pi — e + g{n)/T^{n ) ^ (p,-+ £) + (! + £)(p* 


(1 + £)(p* — P; + 2fi) + 2£ 


Pi + 


(61) 


C Proofs of Propositions SI |5] 


We present the following preliminary bounds to aid in the proofs of Props. [4] [5] In this section, n is 
taken to be an g-ISM index policy as in Eq. [23] Additionally, it is convenient to define 


^a=L 


“ P ~Pi 


(62) 


It follows from Props. [2] [3] that for any £ > 0, 3 (a.s.) some finite A e such that for n ^ A e , the 
following holds: for i / 


1 — £ 

p*-p ; - 


g(”) < T l n {n) ^ 


1 +£ 
P* - Pi 


sip). 


(63) 
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And similarly, for the optimal bandit, 

n - (1 + e)P^g{n) ^ T l n ( n ) < n - (1 - e)PAg(n). (64) 

To simplify the case for the optimal bandit, slightly, it also holds that for all sufficiently large n, 
t£ in) ^ n/2. We’ll also observe here, as an aside, that for some finite N e , 

(1 — e)/(jU* — pi)g(n) > 6, for all n ^ N e , and i / i*. 

As each bandit is activated infinitely often, T'' n increases without bound with n, and hence we may 
apply the Law of the Iterated Logarithm in the following way: There exists a finite time N' e such 
that for n + N'., for each bandit i, 



Pi | ^ cr,V2(1 +e) 


lnlnr^(w) 

nip) 


(65) 


However, since yj lnlnx/x is decreasing for all v + 6, we may apply the above bounds to have that, 
for n ^ max ( N e , N[.. N e . 12), for i / /*, 



Ah| ^ OiV 2(1+ e) 



and for the optimal bandit, 



P* 


A ct ,*\/ 2(1 +1 


I lnln(n/2) 


n/2 


( 66 ) 


(67) 


Proof, of Proposition |4j Let 1 > e > 0. For i / i*, let 

= ( 68 ) 

Observe that hi -X 0 from above as t —>• oo. Note that there exists a 7) <« such that for t ^ T e , 
g(t)/(p* — pi — 2lii(t)) is increasing. The proof proceeds analogously to the proof of Prop. 0 
utilizing the improved iterated logarithm bounds above. 

For n + 7)., define the following functions: 

n 

h\ (n) = £ 1-0(1+ 1) = i,Ui(t,T£(t)) ^ p* -hi(t),X l T i, t) < Pi + h/(t )} 

t—T e 

n 

n‘ 2 (.n) = £ 10(1 + 1) = i,Ui(t, T^(t)) ^ p* - lii(t)X T u t) > Pi + h^t)} (69) 

t—T e 

«)(+> = L 10(1+ 1 )=*>«/(!. ^(O) <m*-a«(i)}- 

t=T e 
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( 70 ) 


Hence, we have the following relationship, that for n A T e , 


T' K {n) < T e + 1 + n\(n) + n' 2 (n) + n’ 3 (n). 


The proof proceeds as in the proof of Prop. [2j bounding each of the three terms. For the first. 


n 


n\ (n) < £ 1 i n( , t + 1 ) = i il J -i + h i( t ) + 8(t)/T^(t) ^ H*-hi(t )} 


n 


= £ 1 {n(t + 1 ) = i,g(t)/((n* - fa) ~ 2 hi(t)) > Ti(t)} 


(71) 


n 


< I l{7r(f + l) = f,g(n)/((M*-M0-2M«)) ^(01 



As before, the last inequality comes from viewing T^{t) as a sum of 1{7 z(t + 1) = /} indicators, and 
seeing that the condition on it bounds the number of non-zero terms in this sum. It is also important 
to observe here that we are explicitly in a regime in which g(t)/((fl* — ft;) — 2/i;(t)) is an increasing 
function with t. 

For the second term. 


n 


n' 2 {n)^ £ + l) = i,X l T ^ >Hi + hi(t)} 


t=T e 

n 




(72) 



The last inequality holds, by the iterated logarithm bound in Eq. (l66l) . Taking it one step further, we 
have 



(73) 


Note that as 



(74) 


the event indicated in the above sum bounding h' 2 (n) may occur only finitely may times, almost 
surely. Hence, n l 2 (n) is almost surely bound by a finite constant, for all n A T e . 


21 

























For the third term, as before, by the structure of the policy, a population is only sampled if it has 
the maximal current index. Hence, if n(t + 1) = i, it must be true that u;> (t , (t)) ^ m(t , T^{t)). It 

follows that 


n\{n) < £ \{K{t+\) = i,Ui*(tJn{t)) <jU*-fc/(f)} 


t=T e 


t=T e 


^ ^ 1 < — c 7,-*\/2(1 + e) 




/lnln(f/2) g(t) ^ | 


(75) 


the last equation coming from the iterated logarithm bound for the optimal bandit, Eq. (1671) . As a 
final simplification, 



2(1 + e) 1 


/lnln(f/2) 


t/2 


< ~hi(t ) 


(76) 


If g(n) = o(n/lnlnn), it is easy to verify that the indicated event in the above sum can only occur 
for finitely many t. Hence, by the above, there is a finite constant bounding n l ^(n) for all n ^ T e . 

Combining the above results, there is a finite constant Df such that for all n^T e , 


4(n) ^ 


g( n ) 


(H* — Hi) ~ 2hi(n) 


+ Df. 


(77) 


We have from this that 

, * , , . , . 2 hdn) , * 

(" - ~ *W « + ("* ~ 

For a fixed e > 0, the above yields (taking the limit, given the choice of /;,(«)), 
limcup A ii)Tj(n)-g(n) ^ 2q,y/2(l+ £ ) 2 . 

n ^(^lnln^n) ^ l-£ 


(78) 


(79) 


As the above holds for all e > 0, this yields, almost surely, 


lim sup 

n 


(M*~A li)Tj(n) -g(n) 
y/g(n)lnlng(n) 


2<7,-v / 2 


( 80 ) 


Proof, of Proposition HJ Let e € (0,1). Recall from the proof of Prop. [3] the infinite sequence 
{ n k HjsO of times at which the index of the optimal bandit i* is maximal. For notational convenience, 
we will write u/(n) = iij(n , T^(n)), and for i / i*, we define 

U[= max Ui(n), (81) 

n k ^n^n k+ i 
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and 


Ml = max XL < ,. 

k n k ^n k+l r *(«) 


( 82 ) 


We have the following relations, 


Ul€ 


max XI 


= MH 


g{nk+ l) 

n k ^n%n k+i T *( n ')) T£(n k ) 
g(n k+ 1 ) g(njt) 


= M^ 


g( n k) U{ n k) 
g( n k+l) (,. ^ 
g(n k ) l Ui{nk) 


For n such that n k ^ n n^+i, trivially n,(n) ^ U^. It follows that 


$(«) 




U(n) ^ V- ' 

Defining the following terms for space, 


g{n k+1 ) ( , , -i 

g (n k ) 


A„,*=(^-XF (n) j, 

«K+t) 




g{n k ) 


Ui*(n k ) ~H*, 


/-> _ gfak+l ) v i ,, 

g(/i*) ^ Mn 

A («) = g(n) ~ O* ~ Hi)Tx(n), 


The above relation may be rearranged to yield 


(83) 


(84) 


(85) 


A(n)/T^(n) ^ A nk + B k -C k . 


( 86 ) 


We may apply the iterated logarithm bounds of Eq. (1661) . to yield a finite K\ such that for k ^ K.\, 


A n ,k ^ 2a,V2(l + £ y 


N 


lnln 


l-e 


l-e 

H* — Vi 


g(n k ) 


(87) 


Similarly, there is a finite K ki such that for k ^ K ki , observing that for sufficiently large k, (n k ) ^ 

n k / 2, 


B k ^ 


*^±2 (V + a,V2(l + ,)J* 1%2) + £W) 

g(«*) y y n k /2 n k /2 J 




( 88 ) 
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And finally, there is a finite Kc such that for k ^ Kc, 

( 


Ck > 


g{>n+ 1) 

g{ n k) 


Hi-OjV2(1+ e] 


\ 




lnlnl pA 


1 —£ 

M*—iu,- 


gink ) 


-Hi. 




Rearranging terms for space again, for k ^ ma x{Ka,Kb,Kc) we have 

A(n)/7^(n) ^A n ^+Bk~Ck ^ Ak + Bk + C k + Dk , 


where 






lnln 


l-e 


l-e 
4*-ft 


g(nk) 


Ck — opV2(l + £ 


} gjnk+ 1 ) \nln(n k /2) 

g( n k) V "it / 2 




g(%+i) g(%) 


n k /2 

Noting that each of the above are positive, we have from Eq. 


A(«) 




(Ak + Bk + Ck + Dk)T ^ (n) 


VM^n) ^g(n)lnlng(n) 

Note that, applying Eq. (163 b in this case, we have some finite K e such that for k Js K e , 

T n(n) < ^(w/fc+i) < * g(wt+i). 


(89) 


(90) 


(91) 


(92) 


(93) 


Recall from the proof of Prop. [3]that there is a finite K' e such that for k^K' e , g{n k+ 1 ) ^ (1 + e)g(rik). 
Noting too that g(rik) C- g(n), we have that for k ^ ma x{K e ,K' e ), 


A(n) 


. {Ak + Bk + Ck + Dk) (l +£) 2 , , 

^ - rSW- 


■^/^nyTnTng^n) y/g(n k ) Inlng(n^) (M* 


(94) 


We have 


Dkg(nk) 


g(nk+i)g(nk) g{ n k) 


V g{n k )\n\ng(n k ) g(nk ) n k /2 y/ g(nk)\n\ng{n k ) 

g(nk ) 3/2 


^ 2(1 + £^ 
= 0 ( 1 ). 


(95) 


n^lnlngfa*) 
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The last relationship follows, taking g(n) = o(rr^). 
We have 


Ckg(n k ) 


y/g(n k )lnlng{n k ) 


= 2<t,-»( 1 + e) 


g(n k + 1 ) lnln(n k /2) g(n k ) 

g{n k ) V n k lnln g(n k ) 


< 2(7;. (1 +e) 2 
= o(l). 


/ lnln(n fc /2) 

n k lnlng(n*) 


The last relationship follows, taking g(n) = 0 ( 71 /In In/i). 
We have 


( 96 ) 


*«(■'»> =ff ,^ ( i +E )f 2 +#nIV 

1 /g(n yt )lnlng(n yt ) V #(”*) / \ 


lnln 


1 —£ 
H*-lk 


1 —£ 




I g{n k ) 
lnln g(n k ) 


< 


tJ;\/2(l + £) (3 + £) 


1-e 

|U*-|U, 


lnln 


l-e 


\ lnlng(nj-) 

a,V2(l + £K3 + £ ) (1 + Q(1)) _ 


(97) 


l-e 

/r*-W 


The last relationship follows, taking the {n k } k ^0 as infinite and unbounded, and g as increasing and 
unbounded. 


We have 

hg{n k ) _ _ / gfa+i) \ / gK) 

y / g(n J ( : )lnlng(n J t) * V g( ? h:) / V lnln ^(^) 


(98) 


Let 5 > 1 by fixed. We use the bound here that for all positive x ^ 1 — 1/5, 1/(1— x) ^ 1 + 8x. 
Applying Eq. (l58l) . we have for sufficiently large k, 


g{n k + 1 ) 

g{n k ) 


1 , g'(»t) 

< _ + g("k) _| 

^ 1 - (1 +E)P\g'(n k ) 

^ (1 + ^(1 + £ )Pxg i n k )) ~ 1 

= g'( n k) (5(1 + £)Pa + o(1)). 


(99) 


The last relationship follows, as g' —>• 0 and g — > °° with n k . Applying this to the above bound, 


A k g(n k ) 


yfg{n k )\n\ng(n k ) 


^ (M*~ Ah) (5(1 + £)7 , A + o(l))g / (nr)^ 


= o(l) 


' g{n k ) 
lnlng(njt) 


( 100 ) 
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The last relationship follows, taking g(n) = o(rr^). 

Applying all of the above to the bound in Eq. ( l94l ). this yields 


H n ) 

^/g(n)lnlng{n) 


a,V2(l + £) (3 + £) 


1-g 

M*-M/ 


(i- 0 (i))+ 0 (i) 


(1 +£) 2 


or 

lim sup AW - < ( ^V2(l+eK3 + e ) \ (1+£)1 

n y/g(n)lnlng(n) ^ ~ ^ 

Taking the limit as e —> 0 completes the proof, 

lim sup ^ 3g£ 

n ^ g(n) Inin g(n) Vl 1 * ~ AT 


( 101 ) 


( 102 ) 


(103) 
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